Urban Mobility Analysis
Literature Review
Our primary source of insight and conceptual foundation was the OECD’s EPIC (Environmental Policies and Individual Behaviour Change) survey [1], conducted across three rounds (2008 to 2022), with an average of 13,000 observations per round and a minimum of 10 developed countries. This study provided insights into household decision-making and transport behaviors, which guided the selection of our parameters and segmentation approach.
Specifically, we followed EPIC’s structure to incorporate socio-demographic variables like car ownership, household income, and the presence of children, as a foundation for our analysis. Additionally, we used this article as our reference for segmentation by trip characteristics (e.g., origin-destination regions, purpose). This methodology enabled us to have more homogeneous datasets to better capture the effects of the parameters of interest. Here is a sample outcome of the EPIC survey, which we tried to reproduce in the preliminary parts of our study.
Research Statement
This study examines two main aspects related to car usage within households: what factors influence a household’s decision to own a car, and what factors determine how car owners decide to use their car as their primary mode of transportation.
We sought to understand the key drivers behind car ownership and the variables that influence car usage among those who own one.
Data Preparing
Dataset
AllgreD
| Label | Missing_Percentage | |
|---|---|---|
| nbarret | Number_of_Stops_During_Trip | 99.842185 |
| abonpeage | Public_Transport_Pass_Holder | 98.538493 |
| motdeacc | Accompanied_Persons_Purpose_at_Destination | 90.723206 |
| motoracc | Accompanied_Persons_Purpose_at_Origin | 90.699190 |
| NAT_STAT | Parking_Type_Used | 57.407026 |
| NUM_VEH | Vehicle_Number_Used | 54.333059 |
| NB_OCCU | Number_of_Occupants_in_Vehicle | 54.333059 |
| LIEU_STAT | Parking_Location_Used | 54.333059 |
| durstat | Parking_Duration | 54.333059 |
| autoroute | Highway_Usage | 54.333059 |
| prisecharge | Transport_Costs_Covered | 31.182242 |
| ntraj | Number_of_Stops_in_Trip | 27.624537 |
| TPS_MAP_DEP | Walking_Time_At_Origin | 27.507891 |
| ZONE_D_TRAJ | Zone_At_Origin_of_Stop | 27.507891 |
| ZONE_A_TRAJ | Zone_At_Destination_of_Stop | 27.507891 |
| TPS_MAP_ARV | Walking_Time_At_Destination | 27.507891 |
| id_traj | Stop_ID | 27.507891 |
| Couteff | Estimated_Transport_Cost | 22.797448 |
| D12 | Travelled_Distance_As_Crow_Flies | 7.393303 |
| D13 | Travelled_Distance_Declared | 7.393303 |
| zoneres.x.1 | Residential_Zone_Number | 4.315905 |
| motifor | Trip_Purpose_at_Origin | 4.315905 |
| zoneorig | Origin_Zone_of_Trip | 4.315905 |
| heuredep | Departure_Time_Hour | 4.315905 |
| mindep | Departure_Time_Minute | 4.315905 |
| motifdes | Trip_Purpose_at_Destination | 4.315905 |
| zonedest | Destination_Zone_Number | 4.315905 |
| heurearr | Arrival_Time_Hour | 4.315905 |
| minarr | Arrival_Time_Minute | 4.315905 |
| duree | Trip_Duration_Declared | 4.315905 |
| nbmodemec | Number_of_Mechanized_Modes_Used | 4.315905 |
| NO_TRAJ | Trip_Element_Number | 4.315905 |
| mode_V2 | Modified_Transport_Mode | 4.315905 |
| mode_depl_ag | Aggregated_Transport_Mode | 4.315905 |
| DEST | DEST | 4.315905 |
| ORIG | ORIG | 4.315905 |
| mot_o_red | Simplified_Purpose_at_Origin | 4.315905 |
| mot_d_red | Simplified_Purpose_at_Destination | 4.315905 |
| tir | Observation_Drawing_Number | 0.000000 |
| NO_MEN | Household_Number | 0.000000 |
| NO_PERS | Individual_Number_in_Household | 0.000000 |
| NO_DEPL | Trip_Number_for_Individual | 0.000000 |
| id_men | Household_ID | 0.000000 |
| id_pers | Individual_ID | 0.000000 |
| id_depl | Trip_ID | 0.000000 |
| UN | UN | 0.000000 |
AllgreI
| Label | Missing_Percentage | |
|---|---|---|
| situveil | Activity_on_Previous_Day | 90.854922 |
| STAT_TRAV | Parking_Difficulties_at_Workplace | 88.147668 |
| VAL_ABO | Public_Transport_Subscription_Validity | 78.044042 |
| PBM_STAT | General_Parking_Problems | 68.290155 |
| dispovp | Has_Access_to_Private_Vehicle | 56.437824 |
| zonetrav | Work_or_Study_Location_Zone | 37.629534 |
| travdom | Works_or_Studies_at_Home | 35.246114 |
| btt | Total_Daily_Travel_Time | 16.295337 |
| fqvelo | Bicycle_Use_Frequency | 6.878238 |
| FQ2R1 | Motorized_Two_Wheeler_Use_Frequency_Type_1 | 6.878238 |
| FQ2R2 | Motorized_Two_Wheeler_Use_Frequency_Type_2 | 6.878238 |
| fqvpcond | Car_Use_Frequency_as_Driver | 6.878238 |
| fqvppass | Car_Use_Frequency_as_Passenger | 6.878238 |
| freqtcu | Urban_Transport_Use_Frequency | 6.878238 |
| freqtram | Tramway_Use_Frequency | 6.878238 |
| freqrurb | Other_Urban_Transport_Use_Frequency | 6.878238 |
| freqtransisere | Transisere_Transport_Use_Frequency | 6.878238 |
| TEL_PORT | Has_Mobile_Phone | 5.841969 |
| Has_Email | 5.841969 | |
| permis | Has_Driving_License | 5.841969 |
| etabscol | Last_Educational_Institution_Attended | 5.841969 |
| OCCU1 | Main_Occupation | 5.841969 |
| OCCU2 | Secondary_Occupation | 5.841969 |
| csp | Socio_Professional_Category | 5.841969 |
| ABO_TC | Has_Public_Transport_Subscription | 5.841969 |
| freqter | Regional_Train_Use_Frequency | 5.841969 |
| statut2 | Aggregated_Socio_Economic_Status | 5.841969 |
| tir | Observation_Drawing_Number | 0.000000 |
| NO_MEN | Household_Number | 0.000000 |
| NO_PERS | Individual_Number_in_Household | 0.000000 |
| zoneres.y | Residential_Zone_Number | 0.000000 |
| sexe | Gender | 0.000000 |
| lien | Relationship_to_Household_Reference | 0.000000 |
| age | Age | 0.000000 |
| id_men | Household_ID | 0.000000 |
| id_pers | Individual_ID | 0.000000 |
| nbd | Number_of_Trips_Made | 0.000000 |
| UN | Unknown | 0.000000 |
| cspgroup | Grouped_Socio_Professional_Category | 0.000000 |
AllgreM
| Label | Missing_Percentage | |
|---|---|---|
| tir | Household Code | 0.000000 |
| NO_MEN | Household Size | 0.000000 |
| zoneres.x | Residence Area | 0.000000 |
| jourdepl | Day of Travel | 0.000000 |
| TYPE_HAB | Housing Type | 0.000000 |
| TYPE_OCU | Occupancy Type | 0.000000 |
| Gare2 | Dept of Reference SNCF Station | 0.000000 |
| Gare5 | Postal Code of Reference SNCF Station | 0.000000 |
| telefon | Has Telephone | 0.000000 |
| annuaire | Listed in Directory | 9.950556 |
| internet | Has Internet | 0.000000 |
| VP_DISPO | Number of Cars Available | 0.000000 |
| GENRE1 | Type of Car 1 | 13.844252 |
| ENERGIE1 | Fuel Type of Car 1 | 13.844252 |
| AN_VP1 | Year of Car 1 | 13.844252 |
| PUIS_VP1 | Engine Power of Car 1 | 13.844252 |
| POSSES1 | Ownership Status of Car 1 | 13.844252 |
| LIEU_STAT1 | Parking Location of Car 1 | 13.844252 |
| TYPE_STAT1 | Parking Type of Car 1 | 13.844252 |
| GENRE2 | Type of Car 2 | 53.522868 |
| ENERGIE2 | Fuel Type of Car 2 | 53.522868 |
| AN_VP2 | Year of Car 2 | 53.522868 |
| PUIS_VP2 | Engine Power of Car 2 | 53.522868 |
| POSSES2 | Ownership Status of Car 2 | 53.522868 |
| LIEU_STAT2 | Parking Location of Car 2 | 53.522868 |
| TYPE_STAT2 | Parking Type of Car 2 | 53.522868 |
| GENRE3 | Type of Car 3 | 92.119901 |
| ENERGIE3 | Fuel Type of Car 3 | 92.119901 |
| AN_VP3 | Year of Car 3 | 92.119901 |
| PUIS_VP3 | Engine Power of Car 3 | 92.119901 |
| POSSES3 | Ownership Status of Car 3 | 92.119901 |
| LIEU_STAT3 | Parking Location of Car 3 | 92.119901 |
| TYPE_STAT3 | Parking Type of Car 3 | 92.119901 |
| GENRE4 | Type of Car 4 | 98.609394 |
| ENERGIE4 | Fuel Type of Car 4 | 98.609394 |
| AN_VP4 | Year of Car 4 | 98.609394 |
| PUIS_VP4 | Engine Power of Car 4 | 98.609394 |
| POSSES4 | Ownership Status of Car 4 | 98.609394 |
| LIEU_STAT4 | Parking Location of Car 4 | 98.609394 |
| TYPE_STAT4 | Parking Type of Car 4 | 98.609394 |
| NB_velo | Number of Bikes | 0.000000 |
| NB_2Rm | Number of Motorcycles | 0.000000 |
| COEF_MNG | Management Coefficient | 0.000000 |
| id_men | Household ID | 0.000000 |
| id_pers | Individual ID | 0.000000 |
| id_depl | Trip ID | 0.000000 |
| id_traj | Stop ID | 30.438813 |
| nb_pers | Number of People | 0.000000 |
| nbt2 | Total Trips | 13.195303 |
| btt2 | Total Daily Travel Time | 13.195303 |
Following steps made
Select our variables and merge the datasets
Factorize the variables needed ORIG, DEST, UN, Area_at_origin_of_stop, travel_mode, zoneorig, zonedest, covered_trip_cost, trip_number, Area_at_destination_of_stop, parking_type, stop_id, mode_V2, highway_used, residence_zone_number, residence_area, id_men, id_pers, id_depl, parking_location, trip_day, housing_type, occupancy_type, POSSES1, POSSES2, POSSES3, POSSES4, dept_sncf_station, postal_sncf_station, socio_category_group, OCCU1, OCCU2, permis, work_zone, car_availability, relationship_status, sexe, public_trans_subscription, parking_problems, socio_category, employment_status, nbmodemec
Create new variables
real_travel_mode = binary variable “VP”, “Autre” The new data distribution for this binary variable is as follows: 10953, 16937
Departure_time
Arrival_time
Filter out missing values from travel_mode(1258)
Imputing missing values
number_of_stops,Area_at_origin_of_stop,Area_at_destination_of_stop,walk_time_at_destination,walk_time_at_origin,stop_id. If there is no Stop => We can replace missing values withno stopparking_durationandhighway_used=> If the person didn’t travel by car ==> No highway used, no parking usedcar_availability=> We used the travel_mode, number_of_cars, num_people to impute those values
Deal with trip loops
| id_pers | total_trips | purpose_sequence | mode_sequence |
|---|---|---|---|
| 903116001 | 2 | ACHAT -> DOMICILE | VP->VP |
| 903116002 | 9 | TRAVAIL -> TRAVAIL -> TRAVAIL -> DOMICILE -> ACHAT -> LOISIR -> LOISIR -> LOISIR -> DOMICILE | VP->Autre->Autre->VP->VP->VP->VP->VP->VP |
| 903116003 | 3 | ACHAT -> LOISIR -> LOISIR | MAP->MAP->MAP |
| 903117001 | 7 | ACHAT -> ACHAT -> DOMICILE -> TRAVAIL -> ACCOMPAGNEMENT -> LOISIR -> DOMICILE | MAP->MAP->MAP->VP->VP->VP->VP |
| 903122001 | 4 | ACHAT -> DOMICILE -> LOISIR -> DOMICILE | MAP->MAP->MAP->MAP |
| 903122002 | 2 | LOISIR -> DOMICILE | VP->VP |
Old data
| id_pers | departure_time | travel_mode | mode_switch | mode_group | trip_sequence | total_trips_same_mode | total_crownTravel_distance | total_actualTravel_distance | total_declared_trip_duration | arrival_time | end_arrival_time |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 101019001 | 09:45 | MAP | FALSE | 0 | 1 | 4 | 2783 | 3.199 | 48 | 10:00 | 15:58 |
| 101019001 | 10:30 | MAP | FALSE | 0 | 2 | 4 | 2783 | 3.199 | 48 | 10:45 | 15:58 |
| 101019001 | 10:50 | MAP | FALSE | 0 | 3 | 4 | 2783 | 3.199 | 48 | 10:55 | 15:58 |
| 101019001 | 15:45 | MAP | FALSE | 0 | 4 | 4 | 2783 | 3.199 | 48 | 15:58 | 15:58 |
| 101019001 | 16:05 | TCU | TRUE | 1 | 1 | 2 | 7194 | 9.999 | 50 | 16:25 | 17:40 |
| 101019001 | 17:10 | TCU | FALSE | 1 | 2 | 2 | 7194 | 9.999 | 50 | 17:40 | 17:40 |
| 101019001 | 18:00 | MAP | TRUE | 2 | 1 | 4 | 8405 | 9.665 | 145 | 19:07 | 24:05 |
| 101019001 | 19:07 | MAP | FALSE | 2 | 2 | 4 | 8405 | 9.665 | 145 | 20:15 | 24:05 |
| 101019001 | 21:50 | MAP | FALSE | 2 | 3 | 4 | 8405 | 9.665 | 145 | 21:55 | 24:05 |
| 101019001 | 24:00 | MAP | FALSE | 2 | 4 | 4 | 8405 | 9.665 | 145 | 24:05 | 24:05 |
New data
| id_pers | travel_mode | mode_switch | mode_group | trip_sequence | total_trips_same_mode | total_crownTravel_distance | total_actualTravel_distance | total_declared_trip_duration | departure_time | arrival_time | end_arrival_time |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 101019001 | MAP | FALSE | 0 | 1 | 4 | 2783 | 3.199 | 48 | 09:45 | 10:00 | 15:58 |
| 101019001 | TCU | TRUE | 1 | 1 | 2 | 7194 | 9.999 | 50 | 16:05 | 16:25 | 17:40 |
| 101019001 | MAP | TRUE | 2 | 1 | 4 | 8405 | 9.665 | 145 | 18:00 | 19:07 | 24:05 |
Total Trips with Same Mode
The analysis reveals a strong positive relationship between the number of consecutive trips made with the same mode and the percentage of car users. Specifically, when individuals anticipate making multiple trips consecutively, the likelihood of choosing a car increases significantly. The percentage of car users rises from 38% for a single trip to its peak value as the total trips made with the same mode approach 16. This trend underscores the preference for cars in scenarios involving repeated, consistent travel.
We then filtered out trips that have more than 13 consecutive trips using the car, because these probably indicates for taxi drivers that we don’t want to consider in our analysis
Insights
Travel Distance and Duration Analysis
The analysis of travel distance highlights that MAP covers the shortest distances, which is expected given its localized nature. VP distances fall between TCU and TCIU, aligning with the car’s flexibility for both urban and intercity travel. Interestingly, trips classified under “Autre” also cover substantial distances, indicating diverse long-distance usage. For travel duration, despite the longer distances, car trips exhibit shorter durations compared to TCU, TCIU, and “Autre.” This suggests a preference for cars when time is a critical factor,
Location Effect on travel mode
Distribution of the travel mode by residence area
inside_the_city outside_the_city
5962 4079
The analysis indicates that car usage is evenly distributed among individuals living inside and outside the city. In contrast, other travel modes from the dataset are predominantly used within the city. This suggests that while cars offer consistent utility regardless of location, alternative modes cater more to urban travel needs.
Distribution of travel_mode by trip type
city_to_city inside_outside outside_the_city Within_the_city
319 1591 2404 5727
The analysis reveals a significant disparity in trip types, with the majority of observations corresponding to trips made within the same city, such as Grenoble, Voiron, St Marcellin, or La Touvet. Alternative modes of transport are predominantly used for within-city travel, whereas cars dominate for intercity and city-to-outside trips. This highlights the reliance on cars for longer and less localized travel.
Effect of distance and duration
Analyzing distance and duration alongside trip types confirms that shorter distances correlate with trips within or just outside the city, while longer distances are associated with intercity or city-to-outside trips—an expected pattern. Within-city and outside-city trips show a preference for cars when distances increase, as they also minimize travel time. For inside-outside or city-to-city trips, the travel distance remains similar regardless of the mode, which is logical given the fixed nature of such routes. However, car users consistently achieve these trips in less time, reinforcing the car’s efficiency for longer travel.
Factors Influencing Household Car Ownership
In this analysis, we examined several key factors influencing car ownership using variables from a research paper on the same topic. These factors include the number of children and adults in the household, urban area, income, and whether the individual lives alone
To do so, we created several new variables: Age_group,
categorized as teenagers, young adults, adults, and seniors based on the
age variable; Urban_area, defined as major city (for
Grenoble or Voiron), suburb (for St Marcellin or La Touvet), and rural
(for locations outside any city); and Income, derived from
the combination of the csp variable (representing profession) and occu1
and occu2 (indicating employment status such as full-time, part-time, or
student). These new variables were essential in understanding the
factors influencing car ownership.
% Diffrence table
| category | group | percent_change_from_benchmark | chisq.test_p_value |
|---|---|---|---|
| Number of Children(Benchmark: No Children in the household) | One Child |
|
8.43e-15 |
| Number of Children(Benchmark: No Children in the household) | Two Children |
|
8.43e-15 |
| Number of Children(Benchmark: No Children in the household) | Three or more Children |
|
8.43e-15 |
| Number of Adults(Benchmark: 1 adult in the household | Three or more adults |
|
6.53e-69 |
| Number of Adults(Benchmark: 1 adult in the household | Two adults |
|
6.53e-69 |
| Urban Area(Benchmark Major City) | Suburb |
|
1.05e-29 |
| Urban Area(Benchmark Major City) | Rural |
|
1.05e-29 |
| Income Group(Benchmark : No income) | High income |
|
3.94e-96 |
| Income Group(Benchmark : No income) | Low income |
|
3.94e-96 |
| Income Group(Benchmark : No income) | Medium income |
|
3.94e-96 |
| Living alone(Benchmark: No) | 1 | -22.29 % | 6.02e-72 |
The data shows a clear trend: as the number of children increases, the likelihood of car ownership rises, with a 9.5% increase for one child, 11.8% for two, and 13.9% for three or more. Similarly, the percentage of households owning a car increases by 21% as the number of adults grows. Urban area also plays a role, with car ownership higher in suburban (9.65%) and rural (14.3%) areas compared to major cities. Income follows the same pattern: higher household income correlates with increased car ownership. Additionally, individuals living alone are less likely to own a car. Importantly, the chi-square test p-value confirms that there is a statistically significant association between these variables and whether a household owns a car, reinforcing the observed patterns.
Predictive Analysis of Car Ownership
- First Model
Train Data:
0 1
290 2132
We can see that the data is imbalanced. To solve this issue we will perform the SMOTE technique:
Smote Data:
0 1
870 1160
Call:
randomForest(formula = have_car ~ living_alone + number_of_adults_group + +have_alternatives + number_of_children + income_value + urban_area, data = smote_train, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 19.21%
Confusion matrix:
0 1 class.error
0 724 146 0.1678161
1 244 916 0.2103448
The random forest model’s out-of-bag (OBB) error rate is 19.2118227
%, meaning approximately 80.7881773% of predictions were correctly
classified. Upon analyzing the variable importance, we found the four
most important factors are number_of_adults, followed by
the income in the household, whether there is only 1
individual living_alone and urban area. To
optimize the model, we can try to remove the least important
variables.
- Confusion Matrix Metrics
Accuracy of the model: 0.7433775
Confusion Matrix: Test Set:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 51 134
1 21 398
Accuracy : 0.7434
95% CI : (0.7066, 0.7778)
No Information Rate : 0.8808
P-Value [Acc > NIR] : 1
Kappa : 0.2719
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.70833
Specificity : 0.74812
Pos Pred Value : 0.27568
Neg Pred Value : 0.94988
Prevalence : 0.11921
Detection Rate : 0.08444
Detection Prevalence : 0.30629
Balanced Accuracy : 0.72823
'Positive' Class : 0
The model shows an accuracy of 74.3377483, but with a significant imbalance in the class distribution, as the No Information Rate (NIR) is much higher at NA, indicating that predicting the majority class without any model would be more accurate. The sensitivity for Class 0 is 70.8333333, which is good, but the positive predictive value (PPV) is quite low at 27.5675676, suggesting that many of the positive predictions for Class 0 are incorrect. The model’s balanced accuracy of 74.3377483` indicates that improvements are needed, especially in predicting Class 0, potentially through techniques like class balancing or model adjustments.
Factors Influencing Car Usage Among Car Owners
Full Sample
Car owner’s
Non car owners
++
The pie charts reveal that car usage is the most prevalent mode of transport across all three trip purposes, with the highest percentage for commuting (home to work), followed by leisure (shopping, leisure, and accompaniment), and lastly for education. Walking is also a notable mode for both education and leisure trips.
Regular access to a car emerges as the key determinant of transport mode choice, reflected in the marked differences between those with and without access to a car. However, car access alone doesn’t fully explain the mode choice, necessitating an exploration of additional factors by comparing respondents from households with car access but differing in other characteristics..
Urban Area
% Difference in Rural Areas
In rural areas, car usage has significantly risen for all trip types—commuting, leisure, and educational—compared to the major city, with increases of 12.1%, 4.3%, and 11.7% respectively. This shift to car usage in rural areas has mostly replaced TCU and MAP. For educational trips, the decline in TCU and MAP is not only offset by cars but also by TCIU and other modes of transportation. Overall, these findings show that moving away from urban areas leads to a greater reliance on cars for all trip types, with the most significant changes occurring in rural areas, especially for leisure and commuting activities.
% Difference in Suburb Areas
In the suburb area, car usage has increased compared to the major city, especially for commuting and leisure trips, with the percentage increase of 6.9% and 9.1% respectively. However, for educational trips, the percentage of car usage had decreased. There has also been a rise in the use of TCIU for commuting and educational trips, with the increase being more noticeable for educational trips and, to a lesser extent, for commuting. The growth in car usage is largely replacing walking (MAP), which has decreased for all trip purposes. Notably, the increase in car usage for leisure activities has primarily replaced TCU.
Chi-Squared Analysis
Chi-squared test of urban areas and travel mode for commuting:
Pearson's Chi-squared test
data: table_commuting
X-squared = 258.4, df = 8, p-value < 2.2e-16
Chi-squared test of urban areas and travel mode for education:
Pearson's Chi-squared test
data: table_education
X-squared = 248.9, df = 8, p-value < 2.2e-16
Chi-squared test of urban areas and travel mode for leisure:
Pearson's Chi-squared test
data: table_leisure
X-squared = 275.87, df = 8, p-value < 2.2e-16
The results of the Chi-squared tests for the relationship between urban area and travel mode across different trip purposes (commuting, education, and leisure) show a highly significant association, with p-values less than 0.05 (indeed, much smaller), indicating a strong statistical relationship between urban area and travel mode choices. These findings align with the previous analysis, where we observed that car usage increases in suburban and rural areas compared to major cities.
Income influence
High Income Low Income Medium Income No Income
Commuting 714 427 1344 433
Education 1 4 0 1236
Leisure 477 2350 1065 714
The analysis reveals a significant increase in car usage as income rises, particularly for commuting trips, where high-income individuals show more than a 20% increase in car usage compared to No-income individuals. This increase mainly replaced trips made by MAP, but also to a lesser extent TCU and TCIU. For leisure activities, there is a notable rise in car usage with increasing income, along with a surprising increase in MAP usage as well. Additionally, for leisure trips, higher income leads to a decrease in the usage of other transport modes (TCU, TCIU, Autres).
We did not try our analysis for education trips, since 99% of the individuals who still study, are without any income.
Chi-Squared Analysis
Chi-squared test of income and travel mode for commuting:
Pearson's Chi-squared test
data: table_commuting
X-squared = 225.56, df = 12, p-value < 2.2e-16
Chi-squared test of income and travel mode for leisure:
Pearson's Chi-squared test
data: table_leisure
X-squared = 11.743, df = 8, p-value = 0.1631
The Chi-squared test results confirm that income significantly influences car usage for commuting purposes, with p-value less than 0.05, supporting the previous analysis that income impacts travel mode choices for these trips. However, for leisure trips, the test yielded a p-value of 0.1631, indicating no significant relationship between income and travel mode, which contradicts the earlier analysis suggesting income might influence car usage for leisure. Thus, while income affects car usage for commuting and education, it does not significantly impact leisure trip choices.
Distance Influence
The analysis shows a substantial increase in car usage, with a rise of
up to 50% for both leisure and commuting trips as distances increase
from short to long. However, there is also a slight uptick in the use of
other modes, such as TCU, TCIU, and Autres, which replace MAP, showing a
decrease of more than 80%. For education trips, car usage also increases
with distance, though not as significantly as for commuting and leisure.
Interestingly, for long distances, car usage decreases between medium
and long distances, primarily replaced by TCIU. For all three trip
purposes, TCU usage declines as distance grows, which is logical, as
buses and trams become less practical for longer journeys.
Chi-Squared Analysis
Chi-squared test of distance and travel mode for commuting:
Pearson's Chi-squared test
data: table_commuting
X-squared = 1553.3, df = 8, p-value < 2.2e-16
Chi-squared test of distance and travel mode for education:
Pearson's Chi-squared test
data: table_education
X-squared = 2287.4, df = 8, p-value < 2.2e-16
Chi-squared test of distance and travel mode for leisure:
Pearson's Chi-squared test
data: table_leisure
X-squared = 841.8, df = 8, p-value < 2.2e-16
The results of the Chi-squared tests for the relationship between distance category (small, medium, or long distances) and travel mode for commuting, education, and leisure trips all show a highly significant association, with p-values much smaller than 0.05. This confirms that the distance category has a statistically significant influence on travel mode choices across different trip purposes. The significant Chi-squared results further confirm that longer distances are strongly associated with a higher reliance on cars
Extra variables
| category | travel_mode | group | percent_diff | chi_squared_p_value |
|---|---|---|---|---|
| sexe(Benchmark: Male) | Autre | 2 | -3.4781354 | 4.42e-14 |
| sexe(Benchmark: Male) | MAP | 2 | 3.5519556 | 4.42e-14 |
| sexe(Benchmark: Male) | TCIU | 2 | 0.4685953 | 4.42e-14 |
| sexe(Benchmark: Male) | TCU | 2 | 2.4293033 | 4.42e-14 |
| sexe(Benchmark: Male) | VP | 2 | -2.9717188 | 4.42e-14 |
| Senior(Benchmark :NO) | Autre | 1 | -6.1290951 | 8.80e-27 |
| Senior(Benchmark :NO) | MAP | 1 | 4.6538475 | 8.80e-27 |
| Senior(Benchmark :NO) | TCIU | 1 | -2.2187492 | 8.80e-27 |
| Senior(Benchmark :NO) | TCU | 1 | -0.2084964 | 8.80e-27 |
| Senior(Benchmark :NO) | VP | 1 | 3.9024932 | 8.80e-27 |
| Has Permit to drive(Benchmark :Yes) | Autre | 0 | 6.8671246 | 1.12e-109 |
| Has Permit to drive(Benchmark :Yes) | MAP | 0 | 6.0395404 | 1.12e-109 |
| Has Permit to drive(Benchmark :Yes) | TCIU | 0 | 4.5943309 | 1.12e-109 |
| Has Permit to drive(Benchmark :Yes) | TCU | 0 | 4.7697513 | 1.12e-109 |
| Has Permit to drive(Benchmark :Yes) | VP | 0 | -22.2707471 | 1.12e-109 |
| Have alternative(Benchmark :NO) | Autre | 1 | 6.5235761 | 3.73e-20 |
| Have alternative(Benchmark :NO) | MAP | 1 | -4.3218818 | 3.73e-20 |
| Have alternative(Benchmark :NO) | TCIU | 1 | 1.5134869 | 3.73e-20 |
| Have alternative(Benchmark :NO) | TCU | 1 | -2.7685434 | 3.73e-20 |
| Have alternative(Benchmark :NO) | VP | 1 | -0.9466378 | 3.73e-20 |
| Student(Benchmark :NO) | Autre | 1 | 7.8308608 | 3.41e-105 |
| Student(Benchmark :NO) | MAP | 1 | 3.9088550 | 3.41e-105 |
| Student(Benchmark :NO) | TCIU | 1 | 4.4446268 | 3.41e-105 |
| Student(Benchmark :NO) | TCU | 1 | 4.8092633 | 3.41e-105 |
| Student(Benchmark :NO) | VP | 1 | -20.9936059 | 3.41e-105 |
| Living Alone(Benchmark :NO) | Autre | 1 | -2.3387713 | 7.36e-09 |
| Living Alone(Benchmark :NO) | MAP | 1 | 3.6393719 | 7.36e-09 |
| Living Alone(Benchmark :NO) | TCIU | 1 | -1.9057719 | 7.36e-09 |
| Living Alone(Benchmark :NO) | TCU | 1 | 2.6062184 | 7.36e-09 |
| Living Alone(Benchmark :NO) | VP | 1 | -2.0010471 | 7.36e-09 |
The analysis of mode of transport usage reveals several trends based on key demographic and lifestyle factors.
Sex : While there is no significant overall difference, females have a slightly higher likelihood of using cars (+3%) and other modes of transport (+3.5%). However, they show a 3.5% decrease in MAP usage compared to males.This could be linked to social and cultural factors where women might be more likely to use cars for commuting or family-related activities.
Senior vs Non-Senior: Seniors (>55 years) exhibit higher car and MAP usage, while other modes, especially “Autres,” show a significant decrease of 6.1%. Older adults may prefer the convenience and comfort of cars or MAP due to mobility challenges or a lack of access to public transportation. They may also be less inclined to use public transport for long distances or during non-peak hours. The large 6.1% decrease in “Autres” usage for seniors is likely due to limited use of alternatives like bicycles or motorcycles, which are less practical or less accessible for this age group.
Not Having a Driving Permit: Individuals without a driving permit are 22% less likely to use a car. This is logical, as people without a driving permit cannot drive a car, thus reducing their chances of choosing this mode of transport. This group is more likely to rely on alternative modes of transport, including walking, public transport, or other forms of mobility.
Alternatives (Motorcycle or Bicycle) : Individuals with alternative transport modes show a large increase in “Autres” usage and a decrease in MAP. Car usage slightly decreases.Individuals with alternative transport modes such as motorcycles or bicycles often use these for short trips or leisure, resulting in increased usage of “Autres” (alternative modes). The decrease in MAP and TCU usage makes sense, as people with alternatives are less likely to use public transport. The slight decrease in car usage suggests that while alternatives offer a viable option, they don’t fully replace the car for longer or more essential trips.
Living Alone: Those living alone are more likely to walk or use TCU, and less likely to use cars or other transport modes.Without the need to coordinate travel with family or friends, they might prefer more flexible, accessible modes like walking or TCU
Importantly, the results of the Chi-squared tests for all these factors show very low p-values, indicating that there is a statistically significant association between these variables (sex, senior status, age group, alternative transport modes, and living alone) and the choice of travel mode. This suggests that the observed differences in travel behavior are not due to chance, reinforcing the trends identified in the descriptive analysis.
Machine Leanring
Targer predict: real_travel_mode Based on has_car, trip_category, distance_category, urban_area, income, living_alone, age_group, has_alternatives, sexe, permis
tibble [9,617 × 11] (S3: tbl_df/tbl/data.frame)
$ real_travel_mode : Factor w/ 2 levels "Autre","VP": 1 2 1 1 1 1 1 1 1 1 ...
$ has_car : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
$ trip_category : Factor w/ 3 levels "Commuting","Education",..: 3 1 3 3 1 1 3 3 3 3 ...
$ distance_category: Factor w/ 3 levels "Long","Medium",..: 3 1 2 2 2 2 2 2 3 2 ...
$ urban_area : Factor w/ 3 levels "Major city","Rural",..: 1 1 1 1 1 1 1 1 1 1 ...
$ income : Factor w/ 4 levels "High Income",..: 2 1 3 3 3 3 3 3 3 3 ...
$ living_alone : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$ age_group : Factor w/ 4 levels "Adults","Seniors",..: 4 1 4 4 4 4 4 4 4 4 ...
$ has_alternatives : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ sexe : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 2 2 2 2 ...
$ permis : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
[1] "train_data"
Autre VP
3338 3395
[1] "test_data"
Autre VP
1430 1454
Random Forest model
Call:
randomForest(formula = real_travel_mode_binary ~ . - real_travel_mode, data = trainData, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 22.87%
Confusion matrix:
0 1 class.error
0 2396 942 0.2822049
1 598 2797 0.1761414
The random forest model shows an overall Out-of-Bag (OOB) error rate
of 22.53%, reflecting its performance on unseen training data. The
confusion matrix indicates that for Class 0, 2,387 instances were
correctly classified while 951 were misclassified, resulting in a class
error rate of 28.49%. For Class 1, the model performed better, correctly
classifying 2,829 instances with 566 misclassifications, yielding a
lower class error rate of 16.67%. This suggests the model is more
effective at predicting Class 1, but it struggles more with accurately
predicting Class 0. To address this imbalance, methods such as
class balancing or hyperparameter tuning may
be beneficial.
Confusion Matrix Train Set:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2396 598
1 942 2797
Accuracy : 0.7713
95% CI : (0.7611, 0.7813)
No Information Rate : 0.5042
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5421
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7178
Specificity : 0.8239
Pos Pred Value : 0.8003
Neg Pred Value : 0.7481
Prevalence : 0.4958
Detection Rate : 0.3559
Detection Prevalence : 0.4447
Balanced Accuracy : 0.7708
'Positive' Class : 0
Confusion Matrix: Test Set:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1022 240
1 408 1214
Accuracy : 0.7753
95% CI : (0.7596, 0.7904)
No Information Rate : 0.5042
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5502
Mcnemar's Test P-Value : 5.367e-11
Sensitivity : 0.7147
Specificity : 0.8349
Pos Pred Value : 0.8098
Neg Pred Value : 0.7485
Prevalence : 0.4958
Detection Rate : 0.3544
Detection Prevalence : 0.4376
Balanced Accuracy : 0.7748
'Positive' Class : 0
The confusion matrices on both the train and test sets reveal
consistent performance, with accuracy scores of 77.42% on the train set
and 77.95% on the test set. The sensitivity and specificity on both sets
are similar, with the test set showing a sensitivity of 70.84% and
specificity of 84.94%, and the train set showing 71.30% sensitivity and
83.45%. Importantly, the minimal differences in performance between the
train and test sets suggest that there is no significant overfitting, as
the model generalizes well to unseen data. While the model performs
better on Class 1, with higher specificity and positive predictive
value, there is room for improvement in detecting Class 0.To address
this imbalance, methods such as class balancing or
hyperparameter tuning may be beneficial
Logit model
Call:
glm(formula = real_travel_mode_binary ~ . - real_travel_mode,
family = "binomial", data = trainData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -17.51731 201.16326 -0.087 0.9306
has_car1 18.69323 201.16320 0.093 0.9260
trip_categoryEducation -0.87061 0.12083 -7.205 5.80e-13 ***
trip_categoryLeisure -0.01951 0.07614 -0.256 0.7977
distance_categoryMedium -0.40237 0.07704 -5.223 1.76e-07 ***
distance_categoryShort -2.44829 0.09592 -25.523 < 2e-16 ***
urban_areaRural 0.32248 0.06458 4.993 5.94e-07 ***
urban_areaSuburb 0.02571 0.14468 0.178 0.8590
incomeLow Income -0.23742 0.11557 -2.054 0.0400 *
incomeMedium Income 0.10062 0.10538 0.955 0.3397
incomeNo Income 0.21868 0.19636 1.114 0.2654
living_alone1 -0.29387 0.09366 -3.138 0.0017 **
age_groupSeniors 0.03795 0.09563 0.397 0.6915
age_groupTeenagers -0.60273 0.24333 -2.477 0.0133 *
age_groupYoung Adults -0.06836 0.09806 -0.697 0.4857
has_alternatives1 -0.02505 0.08630 -0.290 0.7716
sexe2 -0.09590 0.06215 -1.543 0.1228
permis1 0.24419 0.17206 1.419 0.1559
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9333.4 on 6732 degrees of freedom
Residual deviance: 6596.0 on 6715 degrees of freedom
AIC: 6632
Number of Fisher Scoring iterations: 17
Start: AIC=6631.96
real_travel_mode_binary ~ (real_travel_mode + has_car + trip_category +
distance_category + urban_area + income + living_alone +
age_group + has_alternatives + sexe + permis) - real_travel_mode
Df Deviance AIC
- has_alternatives 1 6596.0 6630.0
- permis 1 6598.0 6632.0
<none> 6596.0 6632.0
- sexe 1 6598.3 6632.3
- age_group 3 6602.6 6632.6
- living_alone 1 6605.7 6639.7
- income 3 6610.4 6640.4
- urban_area 2 6621.6 6653.6
- trip_category 2 6655.8 6687.8
- has_car 1 7423.9 7457.9
- distance_category 2 7571.5 7603.5
Step: AIC=6630.04
real_travel_mode_binary ~ has_car + trip_category + distance_category +
urban_area + income + living_alone + age_group + sexe + permis
Df Deviance AIC
- permis 1 6598.0 6630.0
<none> 6596.0 6630.0
- sexe 1 6598.4 6630.4
- age_group 3 6602.9 6630.9
- living_alone 1 6605.9 6637.9
- income 3 6610.4 6638.4
- urban_area 2 6621.6 6651.6
- trip_category 2 6655.9 6685.9
- has_car 1 7425.4 7457.4
- distance_category 2 7571.5 7601.5
Step: AIC=6630
real_travel_mode_binary ~ has_car + trip_category + distance_category +
urban_area + income + living_alone + age_group + sexe
Df Deviance AIC
<none> 6598.0 6630.0
- sexe 1 6600.8 6630.8
- living_alone 1 6607.0 6637.0
- income 3 6613.2 6639.2
- age_group 3 6619.7 6645.7
- urban_area 2 6624.0 6652.0
- trip_category 2 6657.5 6685.5
- distance_category 2 7573.8 7601.8
- has_car 1 7615.3 7645.3
The stepwise model selection process shows that the most important predictors for the model are likely to be has_car, income, and possibly age_group. Removing any of these variables results in a noticeable increase in the AIC, suggesting they are crucial for the model’s performance. Conversely, variables like permis and trip_category seem to have less impact, and their removal did not significantly affect the model’s goodness of fit. Overall, the model becomes simpler with the removal of permis, and the remaining variables seem to provide a reasonable fit, as indicated by the minimal increase in AIC after each step.
Call:
glm(formula = real_travel_mode_binary ~ . - real_travel_mode,
family = "binomial", data = trainData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -17.51731 201.16326 -0.087 0.9306
has_car1 18.69323 201.16320 0.093 0.9260
trip_categoryEducation -0.87061 0.12083 -7.205 5.80e-13 ***
trip_categoryLeisure -0.01951 0.07614 -0.256 0.7977
distance_categoryMedium -0.40237 0.07704 -5.223 1.76e-07 ***
distance_categoryShort -2.44829 0.09592 -25.523 < 2e-16 ***
urban_areaRural 0.32248 0.06458 4.993 5.94e-07 ***
urban_areaSuburb 0.02571 0.14468 0.178 0.8590
incomeLow Income -0.23742 0.11557 -2.054 0.0400 *
incomeMedium Income 0.10062 0.10538 0.955 0.3397
incomeNo Income 0.21868 0.19636 1.114 0.2654
living_alone1 -0.29387 0.09366 -3.138 0.0017 **
age_groupSeniors 0.03795 0.09563 0.397 0.6915
age_groupTeenagers -0.60273 0.24333 -2.477 0.0133 *
age_groupYoung Adults -0.06836 0.09806 -0.697 0.4857
has_alternatives1 -0.02505 0.08630 -0.290 0.7716
sexe2 -0.09590 0.06215 -1.543 0.1228
permis1 0.24419 0.17206 1.419 0.1559
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9333.4 on 6732 degrees of freedom
Residual deviance: 6596.0 on 6715 degrees of freedom
AIC: 6632
Number of Fisher Scoring iterations: 17
The model has a p-value for the overall significance (as indicated by the Wald test) of less than 2e-16, which suggests that the model is highly significant and provides a good fit to the data. Despite this, the large negative intercept and some insignificant predictors imply there may be potential for further refinement to improve predictive power.
For the variables:
Strong Negative Influences: Short trips and education-related trips significantly reduce the likelihood of car usage. Short trips, in particular, have a very strong negative effect. Teenagers also show a strong reduction in the likelihood of car usage.
Strong Positive Influence: The presence of a car (has_car1) has a very strong positive impact on car usage, with a coefficient of +18.8. This suggests that having a car greatly increases the likelihood of its usage. Similarly, living in rural areas also has a positive impact on car usage, indicating a higher likelihood of car usage compared to urban areas.
Moderate Effects: Medium-distance trips and low-income individuals moderately reduce the likelihood of car usage, while medium-income individuals show a slight increase in car usage.
Minimal Impact: Leisure trips (trip_categoryLeisure) and living in suburban areas (urban_areaSuburb) show small, relatively weak effects on car usage. Their coefficients suggest that these variables have little to no impact on car usage compared to the more influential factors.
The variables has_car1 is not statistically significant,
as its p-values is >0.05. However, since the stepwise selection
process did not exclude it, it is likely retained in the model to avoid
reducing its explanatory power or altering its structure, which could
lead to the loss of important relationships or an over-simplified
model.
Confusion Matrix: Train Set:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2236 483
1 1102 2912
Accuracy : 0.7646
95% CI : (0.7543, 0.7747)
No Information Rate : 0.5042
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5284
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.6699
Specificity : 0.8577
Pos Pred Value : 0.8224
Neg Pred Value : 0.7255
Prevalence : 0.4958
Detection Rate : 0.3321
Detection Prevalence : 0.4038
Balanced Accuracy : 0.7638
'Positive' Class : 0
Confusion Matrix: Test Set:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 945 195
1 485 1259
Accuracy : 0.7642
95% CI : (0.7483, 0.7796)
No Information Rate : 0.5042
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5276
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.6608
Specificity : 0.8659
Pos Pred Value : 0.8289
Neg Pred Value : 0.7219
Prevalence : 0.4958
Detection Rate : 0.3277
Detection Prevalence : 0.3953
Balanced Accuracy : 0.7634
'Positive' Class : 0
Same Conclusion for the confusion matrices for the logit model.the
minimal differences in performance between the train and test sets
suggest that there is no significant overfitting, as the model
generalizes well to unseen data. While the model performs better on
Class 1, with higher specificity and positive predictive value, there is
room for improvement in detecting Class 0.To address this imbalance,
methods such as class balancing or
hyperparameter tuning may be beneficial